Windsor
- Europe > Austria > Vienna (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
- (6 more...)
Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy
Jackson, Daniel I, Jensen, Emma L, Hussain, Syed-Amad, Sezgin, Emre
Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- North America > Costa Rica (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning
Chen, Zijun, Hu, Wenbo, Hong, Richang
Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities in both large language models (LLMs) and multimodal large language models (MLLMs). However, its reliability is often undermined by the accumulation of errors in intermediate steps. This paper introduces an novel approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding. We discover that specific attention head activations reliably reflect the truthfulness of reasoning steps in CoT. Based on this insight, we train a confidence predictor to evaluate the correctness of each reasoning step using these truthfulness-sensitive activations, dynamically selecting the most plausible reasoning path via beam search. Experimental results demonstrate that our method significantly outperforms the state-of-the-art baselines (e.g., Few-Shot CoT, Self-Consistency, and Self-Evaluation Guided Beam Search) across the mathematical, symbolic, and commonsense reasoning tasks, exhibiting superior accuracy and reliability in both unimodal and multimodal settings. We further validate the approach on large reasoning models, confirming its applicability to specialized reasoning models. Additionally, we explore the role of the model's self-correction ability in CoT reasoning. This work provides a novel reliability improvement path for CoT reasoning with broad application potential.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > China > Anhui Province > Hefei (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Europe > Austria > Vienna (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.68)
- Information Technology > Artificial Intelligence > Cognitive Science (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Lifetime-Aware Design of Item-Level Intelligence
Prakash, Shvetank, Cheng, Andrew, Kindgren, Olof, Ahamed, Ashiq, Knight, Graham, Kufel, Jed, Rodriguez, Francisco, Tschand, Arya, Kong, David, Elgamal, Mariam, Huang, Jerry, Chen, Emma, Hills, Gage, Price, Richard, Ozer, Emre, Reddi, Vijay Janapa
We present FlexiFlow, a lifetime-aware design framework for item-level intelligence (ILI) where computation is integrated directly into disposable products like food packaging and medical patches. Our framework leverages natively flexible electronics which offer significantly lower costs than silicon but are limited to kHz speeds and several thousands of gates. Our insight is that unlike traditional computing with more uniform deployment patterns, ILI applications exhibit 1000X variation in operational lifetime, fundamentally changing optimal architectural design decisions when considering trillion-item deployment scales. To enable holistic design and optimization, we model the trade-offs between embodied carbon footprint and operational carbon footprint based on application-specific lifetimes. The framework includes: (1) FlexiBench, a workload suite targeting sustainability applications from spoilage detection to health monitoring; (2) FlexiBits, area-optimized RISC-V cores with 1/4/8-bit datapaths achieving 2.65X to 3.50X better energy efficiency per workload execution; and (3) a carbon-aware model that selects optimal architectures based on deployment characteristics. We show that lifetime-aware microarchitectural design can reduce carbon footprint by 1.62X, while algorithmic decisions can reduce carbon footprint by 14.5X. We validate our approach through the first tape-out using a PDK for flexible electronics with fully open-source tools, achieving 30.9kHz operation. FlexiFlow enables exploration of computing at the Extreme Edge where conventional design methodologies must be reevaluated to account for new constraints and considerations.
- Asia > India (0.46)
- North America > United States > New York > New York County > New York City (0.14)
- Europe > Germany (0.04)
- (8 more...)
LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification
The classification of imbalanced data streams, which have unequal class distributions, is a key difficulty in machine learning, especially when dealing with multiple classes. While binary imbalanced data stream classification tasks have received considerable attention, only a few studies have focused on multi-class imbalanced data streams. Effectively managing the dynamic imbalance ratio is a key challenge in this domain. This study introduces a novel, robust, and resilient approach to address these challenges by integrating Locality Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) into the Dynamic Ensemble Diversification (DynED) framework. To the best of our knowledge, we present the first application of LSH-RHP for undersampling in the context of imbalanced non-stationary data streams. The proposed method undersamples the majority classes by utilizing LSH-RHP, provides a balanced training set, and improves the ensemble's prediction performance. We conduct comprehensive experiments on 23 real-world and ten semi-synthetic datasets and compare LSH-DynED with 15 state-of-the-art methods. The results reveal that LSH-DynED outperforms other approaches in terms of both Kappa and mG-Mean effectiveness measures, demonstrating its capability in dealing with multi-class imbalanced non-stationary data streams. Notably, LSH-DynED performs well in large-scale, high-dimensional datasets with considerable class imbalances and demonstrates adaptation and robustness in real-world circumstances. To motivate our design, we review existing methods for imbalanced data streams, outline key challenges, and offer guidance for future work. For the reproducibility of our results, we have made our implementation available on GitHub.
- Oceania > Australia (0.04)
- North America > United States > Kansas > Riley County > Manhattan (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- (8 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.66)
- Education > Educational Setting > Online (1.00)
- Health & Medicine (0.67)
CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning
Choi, Seewon, Solko-Breslin, Alaia, Alur, Rajeev, Wong, Eric
Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using only end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and summaries. We provide theoretical insight into the maximum error of the approximation. Furthermore, we evaluate CTSketch on many benchmarks from the neurosymbolic literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that have previously been unattainable by obtaining high accuracy on tasks involving over one thousand inputs.
- North America > United States > Pennsylvania (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Europe > United Kingdom > England > Berkshire > Windsor (0.04)
Performance-bounded Online Ensemble Learning Method Based on Multi-armed bandits and Its Applications in Real-time Safety Assessment
Hu, Songqiao, Liu, Zeyi, He, Xiao
--Ensemble learning plays a crucial role in practical applications of online learning due to its enhanced classification performance and adaptable adjustment mechanisms. However, most weight allocation strategies in ensemble learning are heuristic, making it challenging to theoretically guarantee that the ensemble classifier outperforms its base classifiers. T o address this issue, a performance-bounded online ensemble learning method based on multi-armed bandits, named PB-OEL, is proposed in this paper . Specifically, multi-armed bandit with expert advice is incorporated into online ensemble learning, aiming to update the weights of base classifiers and make predictions. A theoretical framework is established to bound the performance of the ensemble classifier relative to base classifiers. By setting expert advice of bandits, the bound exceeds the performance of any base classifier when the length of data stream is sufficiently large. Additionally, performance bounds for scenarios with limited annotations are also derived. Numerous experiments on benchmark datasets and a dataset of real-time safety assessment tasks are conducted. The experimental results validate the theoretical bound to a certain extent and demonstrate that the proposed method outperforms existing state-of-the-art methods. Index T erms --Online ensemble learning, performance bound, multi-armed bandits, concept drift, real-time safety assessment. NLINE learning (OL) holds significant potential for handling continuous data and is widely applied across various domains, including industry, recommendation systems, finance, and control systems [1]-[5]. The objective of OL is to continuously learn and update models from new data, enabling adaptation to non-stationary environments for optimized predictions or decisions. One mainstream idea in OL relies on maintaining a set of vectors for decision, as exemplified by the perceptron algorithm [6], passive-aggressive algorithm [7], confidence weighted-based algorithm [8] and imbalanced class weighted-based algorithm [9].
- Asia > China > Beijing > Beijing (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States > California (0.04)
- (3 more...)
- Research Report (1.00)
- Instructional Material > Online (1.00)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)
Federated Continual Instruction Tuning
Guo, Haiyang, Zeng, Fanhu, Zhu, Fei, Liu, Wenzhuo, Wang, Da-Han, Xu, Jian, Zhang, Xu-Yao, Liu, Cheng-Lin
A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Our source code and dataset will be made publicly available.
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Berkshire > Windsor (0.04)
- Information Technology > Security & Privacy (0.68)
- Health & Medicine (0.46)
Mutation-Guided LLM-based Test Generation at Meta
Foster, Christopher, Gulati, Abhishek, Harman, Mark, Harper, Inna, Mao, Ke, Ritchey, Jillian, Robert, Hervé, Sengupta, Shubho
This paper describes Meta's ACH system for mutation-guided LLM-based test generation. ACH generates relatively few mutants (aka simulated faults), compared to traditional mutation testing. Instead, it focuses on generating currently undetected faults that are specific to an issue of concern. From these currently uncaught faults, ACH generates tests that can catch them, thereby `killing' the mutants and consequently hardening the platform against regressions. We use privacy concerns to illustrate our approach, but ACH can harden code against {\em any} type of regression. In total, ACH was applied to 10,795 Android Kotlin classes in 7 software platforms deployed by Meta, from which it generated 9,095 mutants and 571 privacy-hardening test cases. ACH also deploys an LLM-based equivalent mutant detection agent that achieves a precision of 0.79 and a recall of 0.47 (rising to 0.95 and 0.96 with simple pre-processing). ACH was used by Messenger and WhatsApp test-a-thons where engineers accepted 73% of its tests, judging 36% to privacy relevant. We conclude that ACH hardens code against specific concerns and that, even when its tests do not directly tackle the specific concern, engineers find them useful for their other benefits.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Norway > Central Norway > Trøndelag > Trondheim (0.05)
- South America > Brazil (0.04)
- (15 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)